Introduction

About Dataset (information obtained at the original article mentioned at source)

The original authors from the project collected students’ information of three different educational levels:
1. School,
2. College,
3. University.

The data collection was conducted by online and physical surveys. Each one of the surveys form consisted of an individual’s socio-demographic factors. The data collected was 1205 data from the time period from December 10th, 2020 to February 5th, 2021 and the 14 attributes collected were:


1. Gender: Gender type of student
2. Age: Age range of the student
3. Education Level: Education institution level
4. Institution Type: Education institution type
5. IT Student: Studying as IT student or not
6. Location: whether student is located in town or not
7. Load-shedding: Level of load shedding
8. Financial Condition: Financial condition of family
9. Internet Type: Internet type used mostly in device
10. Network Type: Network connectivity type
11. Class Duration: Daily class duration
12. Self LMS: Institution’s own LMS availability
13. Device: Device used mostly in class
14. Adaptability Level: Adaptability level of the student

Goal of the notebook:

Predict the outcome of Students’ Adaptability Level Prediction in Online Education using Machine Learning Approaches. Our target column is (14) Adaptability Level: Adaptability level of the student.Our goal include 3 main parts below:


1) Univariate & Multivariate Analysis on features;


2) Data Preprocessing for ML MCA Prediction;


3) ML Model Generation - MCA

Table of Contents


1. R setup
2. Knowing my data
3. Univariate Analysis
4. Multivariate Analysis
5. Preparing the data to MCA
6. Contingency tables
7. Create MCA

1. R Steup


At this step we will install or import libraries that are necessary to this project

pack <- c("plotly", 
             "tidyverse", 
             "ggrepel",
             "knitr", "kableExtra", 
             "sjPlot", 
             "FactoMineR", 
             "amap", 
             "ade4",
             "readxl",
          "viridis")

if(sum(as.numeric(!pack%in%installed.packages()))!= 0){
  installing <- pacotes[!pack %in% installed.packages()]
  for(i in 1:length(installing)) {
    install.packages(installing, dependencies = T)
    break()}
  sapply(pack, require, character = T) 
} else {
  sapply(pack, require, character = T) 
}
##     plotly  tidyverse    ggrepel      knitr kableExtra     sjPlot FactoMineR 
##       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 
##       amap       ade4     readxl    viridis 
##       TRUE       TRUE       TRUE       TRUE


The next step inside our setup is to load the data. The data was stored at .csv format and in the same folder where this script is saved.

 original_data <- read.csv("students_adaptability_level_online_education.csv")

2. Knowing My Data

At this stage I would like to know:
Column names and its position identification
Type of variables that were registered (as we will work with AC we need to identify if conversion is needed)
There is any NA and where
What are the unique values

#Column names and its position identification
  for (c in 1:14){
    print (paste(c,"","is","",colnames(original_data)[c]))
  }
## [1] "1  is  Gender"
## [1] "2  is  Age"
## [1] "3  is  Education.Level"
## [1] "4  is  Institution.Type"
## [1] "5  is  IT.Student"
## [1] "6  is  Location"
## [1] "7  is  Load.shedding"
## [1] "8  is  Financial.Condition"
## [1] "9  is  Internet.Type"
## [1] "10  is  Network.Type"
## [1] "11  is  Class.Duration"
## [1] "12  is  Self.Lms"
## [1] "13  is  Device"
## [1] "14  is  Adaptivity.Level"
# For variable types, easy way
summary(original_data)
##     Gender              Age            Education.Level    Institution.Type  
##  Length:1205        Length:1205        Length:1205        Length:1205       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##   IT.Student          Location         Load.shedding      Financial.Condition
##  Length:1205        Length:1205        Length:1205        Length:1205        
##  Class :character   Class :character   Class :character   Class :character   
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character   
##  Internet.Type      Network.Type       Class.Duration       Self.Lms        
##  Length:1205        Length:1205        Length:1205        Length:1205       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##     Device          Adaptivity.Level  
##  Length:1205        Length:1205       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character
# There is a NA?
for (c in colnames(original_data)){
  for (i in 1:nrow(original_data)){
    if (is.na(original_data[i, c])){
      print(paste("There is a NA at",i,c))
      break
    }else{
      next
    }
  }
}

There is no NA on the data set.

#Unique values
for (c in colnames(original_data)){
 print(paste("To column ",c))
 print(paste(unique(original_data[[c]])))
}
## [1] "To column  Gender"
## [1] "Boy"  "Girl"
## [1] "To column  Age"
## [1] "21-25" "16-20" "11-15" "26-30" "6-10"  "1-5"  
## [1] "To column  Education.Level"
## [1] "University" "College"    "School"    
## [1] "To column  Institution.Type"
## [1] "Non Government" "Government"    
## [1] "To column  IT.Student"
## [1] "No"  "Yes"
## [1] "To column  Location"
## [1] "Yes" "No" 
## [1] "To column  Load.shedding"
## [1] "Low"  "High"
## [1] "To column  Financial.Condition"
## [1] "Mid"  "Poor" "Rich"
## [1] "To column  Internet.Type"
## [1] "Wifi"        "Mobile Data"
## [1] "To column  Network.Type"
## [1] "4G" "3G" "2G"
## [1] "To column  Class.Duration"
## [1] "3-6" "1-3" "0"  
## [1] "To column  Self.Lms"
## [1] "No"  "Yes"
## [1] "To column  Device"
## [1] "Tab"      "Mobile"   "Computer"
## [1] "To column  Adaptivity.Level"
## [1] "Moderate" "Low"      "High"

3. Univariate Analysis

Gender and Institution.Type
slices_gender <- c(sum(original_data[1]=='Girl'),sum(original_data[1]=='Boy'))
pcts <- round((slices_gender/sum(slices_gender)*100))
lbls <- c("Girl","Boy")
lbls <- paste(lbls, pcts) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(slices_gender,labels = lbls, main="Gender Distribution\n(Gender type of student)")

slices_inst <- c(sum(original_data[4]=='Non Government'),sum(original_data[4]=='Government'))
pcs <- round((slices_inst/sum(slices_inst)*100))
lbls <- c("Non Government","Government")
lbls <- paste(lbls, pcs) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(slices_inst,labels = lbls,  col=rainbow(length(lbls)), main="Education institution\n(Management Type)")


From this Univariate analysis is possible to conclude that:
a)Male accounts for 55% and Female separately accounts for 45%;
b)Regarding the institution type is possible to verify that 68,3% of the data set were made by students from Non Government managed institutions.

Age & Class Duration
bar_data <- c(sum(original_data[2] == "21-25"),sum(original_data[2] =="16-20"),sum(original_data[2] == "11-15"),sum(original_data[2] == "26-30"), sum(original_data[2] == "6-10"), sum(original_data[2] ==  "1-5"))
lbsB <- c("21-25", "16-20", "11-15", "26-30", "6-10","1-5")
barplot(bar_data,names.arg=lbsB,xlab="Age",ylab="Count",col=viridis(length(lbsB)),
main="Age chart",border="white")

bar_data_class <- c(sum(original_data[11] == "3-6"),sum(original_data[11] =="1-3"),sum(original_data[11] == "0"))
lbsC <- c("3-6", "1-3", "0")
barplot(bar_data_class,names.arg=lbsC,xlab="Duration",ylab="Count",col=viridis(length(lbsC)),
main="Class Duration",border="white")


From this Univariate analysis is possible to conclude that:
a) Age of the respondents is mainly distributed between 11 and 25;
b)and the Class Duration mainly distributed between 1-3 hours.

Education Level & Financial Condition & Network Type
slices_ed_lev <- c(sum(original_data[3]=='College'),sum(original_data[3]=='3G'),sum(original_data[3]=='School'))
pcsEd <- round((slices_ed_lev/sum(slices_ed_lev)*100))
lblsEd <- c("College","University","School")
lblsEd <- paste(lblsEd, pcsEd,"%", sep = " ") # add percents to labels
pie(slices_ed_lev,labels = lblsEd,  col=rainbow(length(lblsEd)), main="Education Level")

slices_fin <- c(sum(original_data[8]== "Mid" ),sum(original_data[8]==  "Poor" ),sum(original_data[8]== "Rich"))
pcsF <- round((slices_fin/sum(slices_fin)*100))
lblsF <- c("Mid","Poor","Rich")
lblsF <- paste(lblsF, pcsF,"%", sep = " ") # add percents to labels
pie(slices_fin,labels = lblsF,  col=rainbow(length(lblsF)), main="Financial Condition")

slices_NT <- c(sum(original_data[10]=='4G'),sum(original_data[10]=='3G'),sum(original_data[10]=='2G'))
pcsNT <- round((slices_NT/sum(slices_NT)*100))
lblsNT <- c("4G","3G","2G")
lblsNT <- paste(lblsNT, pcsNT,"%", sep = " ") # add percents to labels
pie(slices_NT,labels = lblsNT,  col=rainbow(length(lblsNT)), main="Network Type")


From this Univariate analysis is possible to conclude that most of the respondents:
a) Attend School and University
b) Are situated in Mid financial condition
c)Make use of 4G connection. Also is important to note that, there are just 1.58% of students using 2G network.

IT Student & Location & Self Lms
slices_IT <- c(sum(original_data[5]== "Yes" ),sum(original_data[5]==  "No" ))
pcsIT <- round((slices_IT/sum(slices_IT)*100))
lblsIT <- c("Yes","No")
lblsIT <- paste(lblsIT, pcsIT,"%", sep = " ") # add percents to labels
pie(slices_IT,labels = lblsIT,  col=rainbow(length(lblsIT)), main= "IT student")

slices_Loc <- c(sum(original_data[6]== "Yes" ),sum(original_data[6]==  "No" ))
pcsLoc <- round((slices_Loc/sum(slices_Loc)*100))
lblsLoc <- c("Yes","No")
lblsLoc <- paste(lblsLoc, pcsLoc,"%", sep = " ") # add percents to labels
pie(slices_Loc,labels = lblsLoc,  col=rainbow(length(lblsLoc)), main= "Location")

slices_SLm <- c(sum(original_data[12]== "Yes" ),sum(original_data[12]==  "No" ))
pcsSLm <- round((slices_SLm/sum(slices_SLm)*100))
lblsSLm <- c("Yes","No")
lblsSLm <- paste(lblsSLm, pcsSLm,"%", sep = " ") # add percents to labels
pie(slices_SLm,labels = lblsSLm,  col=rainbow(length(lblsSLm)), main= "Self LMS")


From this Univariate analysis is possible to conclude that:
a) Around 25.2% of respondents are IT student
b) 77.6% of them are located in town
c)17.4% of their institutions own LMS availability

Load-shedding & Internet Type & Device
slices_L_s <- c(sum(original_data[7]== "High" ),sum(original_data[7]==  "Low" ))
pcs_L_s <- round((slices_L_s/sum(slices_L_s)*100))
lbls_L_s <- c("Low","High")
lbls_L_s <- paste(lbls_L_s, pcs_L_s,"%", sep = " ") # add percents to labels
pie(slices_L_s,labels = lbls_L_s,  col=rainbow(length(lbls_L_s)), main= "Load-shedding")

slices_Inter_typ <- c(sum(original_data[9]== "Wifi" ),sum(original_data[9]==  "Mobile Data" ))
pcsInter_typ <- round((slices_Inter_typ/sum(slices_Inter_typ)*100))
lblsInter_typ <- c("Wifi","Mobile Data")
lblsInter_typ <- paste(lblsInter_typ, pcsInter_typ,"%", sep = " ") # add percents to labels
pie(slices_Inter_typ,labels = lblsInter_typ,  col=rainbow(length(lblsInter_typ)), main= "Internet type")

slices_D <- c(sum(original_data[13]== "Tab" ),sum(original_data[13]== "Mobile" ),sum(original_data[13]== "Computer" ))
pcsD <- round((slices_D/sum(slices_D)*100))
lblsD <- c("Tab","Mobile", "Computer")
lblsD <- paste(lblsD, pcsD,"%", sep = " ") # add percents to labels
pie(slices_D,labels = lblsD,  col=rainbow(length(lblsD)), main= "Device type")


From this Univariate analysis is possible to conclude that:
a) Load-shedding refers to level of load shedding, and type ‘low’ accounts for 83.3%
b) 57.7% of respondents use mobile data to take classes
c) 84.1% of the respondents use mobile.

Adaptivity Level (Target Column)
slices_Ad <- c(sum(original_data[14]== "Moderate" ),sum(original_data[14]== "Low" ),sum(original_data[14]== "High" ))
pcsAd <- round((slices_Ad/sum(slices_Ad)*100))
lblsAd <- c("Moderate","Low","High")
lblsAd <- paste(lblsAd, pcsAd,"%", sep = " ") # add percents to labels
pie(slices_Ad,labels = lblsAd,  col=rainbow(length(lblsAd)), main= "Adaptivity Level")


Adaptability level refers to adaptability level of the student during online education. It can be seen that about 52% respond that they have moderate adaptability, while low accounts for 40%, high responds for 8% for this sample.

4. Multivariate Analysis

Age Distribution by Gender
tt <- table(original_data[,1:2]) # Create a contingency table
df_tt <- data.frame(expand.grid(row.names(tt),colnames(tt)),c(tt)) # transofrm into a data frame
colnames(df_tt) <- c("Gender","Age","Count.t")
  ggplot(df_tt, aes(fill = Gender, y = Count.t,x = Age))+
    geom_bar(position="dodge", stat="identity")+
    geom_text(aes(x = Age, y = as.numeric(Count.t), label = format(Count.t, digits = 1)),
            size = 2.5,
            position = position_dodge(.9),
            inherit.aes = TRUE,
            na.rm = TRUE, vjust = -1)+
    ggtitle("Gender and Age Distribution")


The distribution of Age per gender is:
a) Female’s ditributed mainly in between 11-15 and 15-20;
b) Male ditributed mainly in between 11-15 and 21-25.

IT Student Distribution by Gender
tt_2 <- table(original_data[,c(1,5)]) # Create a contingency table
df_tt_2 <- data.frame(expand.grid(row.names(tt_2),colnames(tt_2)),c(tt_2))
colnames(df_tt_2) <- c("Gender","IT","Count.t")
  ggplot(df_tt_2, aes(fill = Gender, y = Count.t,x = IT))+
    geom_bar(position="dodge", stat="identity")+
    geom_text(aes(x = IT, y = as.numeric(Count.t), label = format(Count.t, digits = 1)),
            size = 2.5,
            position = position_dodge(.9),
            inherit.aes = TRUE,
            na.rm = TRUE, vjust = -1)+
    ggtitle("Gender and IT studies")


Is possible to notice that the number of males studying IT compared to female, for this survey, is higher.

IT Student Distribution by Education
tt_3 <- table(original_data[,c(3,5)]) # Create a contingency table
df_tt_3 <- data.frame(expand.grid(row.names(tt_3),colnames(tt_3)),c(tt_3))
colnames(df_tt_3) <- c("Education.Level","IT","Count.t")
  ggplot(df_tt_3, aes(fill = Education.Level, y = Count.t,x = IT))+
    geom_bar(position="dodge", stat="identity")+
    geom_text(aes(x = IT, y = as.numeric(Count.t), label = format(Count.t, digits = 1)),
            size = 2.5,
            position = position_dodge(.9),
            inherit.aes = TRUE,
            na.rm = TRUE, vjust = -1)+
    ggtitle("Education Level and IT studies")


The got the most IT students, whereas 30 and 27 IT students in college and school.

Gender Distribution by Adaptivity Level
tt_4 <- table(original_data[,c(1,14)]) # Create a contingency table
df_tt_4 <- data.frame(expand.grid(row.names(tt_4),colnames(tt_4)),c(tt_4))
colnames(df_tt_4) <- c("Gender","Adaptability","Count.t")
  ggplot(df_tt_4, aes(fill = Gender, y = Count.t,x = Adaptability))+
    geom_bar(position="dodge", stat="identity")+
    geom_text(aes(x = Adaptability, y = as.numeric(Count.t), label = format(Count.t, digits = 1)),
            size = 2.5,
            position = position_dodge(.9),
            inherit.aes = TRUE,
            na.rm = TRUE, vjust = -1)+
    ggtitle("Gender and Adaptivity Level")


For this survey, 71 male respondents answered as they have high adaptivity level in online education compared to 29 female.
Additionally, both male and female overall got moderate adaptivity level.

Age Distribution by Adaptivity Level
tt_5 <- table(original_data[,c(2,14)]) # Create a contingency table
df_tt_5 <- data.frame(expand.grid(row.names(tt_5),colnames(tt_5)),c(tt_5))
colnames(df_tt_5) <- c("Age","Adaptability","Count.t")
  ggplot(df_tt_5, aes(fill = Age, y = Count.t,x = Adaptability))+
    geom_bar(position="dodge", stat="identity")+
    geom_text(aes(x = Adaptability, y = as.numeric(Count.t), label = format(Count.t, digits = 1)),
            size = 2.5,
            position = position_dodge(.9),
            inherit.aes = TRUE,
            na.rm = TRUE, vjust = -1)+
    ggtitle("Age Distribution by Adaptivity Level")


a) 11-15, 21-25 and 26-30 high adaptivity levelif compared with the other ages. None of 1-5 got high adataptivity.

Gender and Age distribution by adaptability
fem <- original_data%>%
  filter(Gender == "Girl")
tt_6 <- table(fem[,c(2,14)]) # Create a contingency table
tt_6
##        Adaptivity.Level
## Age     High Low Moderate
##   1-5      0   3       64
##   11-15    5  46       91
##   16-20    5 104       60
##   21-25    6  45       58
##   26-30    0  26        0
##   6-10    13  11        5
df_tt_6 <- data.frame(expand.grid(row.names(tt_6),colnames(tt_6)),c(tt_6))
colnames(df_tt_6) <- c("Age","Adaptability","Count.t")
  ggplot(df_tt_6, aes(fill = Adaptability, y = Count.t,x = Age))+
    geom_bar(position="dodge", stat="identity")+
    geom_text(aes(x = Age, y = as.numeric(Count.t), label = format(Count.t, digits = 1)),
            size = 2.5,
            position = position_dodge(.9),
            inherit.aes = TRUE,
            na.rm = TRUE, vjust = -1)+
    ggtitle("Age Distribution by Adaptivity Level for Girls")


From this chart is possible observe that, for the Female gender respondents for this survey:
a) Between the ages 26-30 all respondents answer as having Low adaptivity
b) Bigger number of responses related to the Moderate adaptativity were observed at the ages between 1-5,11-5 and 21-25
c) A slightly higher adaptivity was noticed at the age range of 6-10, being just to answers higher is not possible afirm anything.

male <- original_data%>%
  filter(Gender == "Boy")
tt_7 <- table(male[,c(2,14)]) # Create a contingency table
tt_7
##        Adaptivity.Level
## Age     High Low Moderate
##   1-5      0  14        0
##   11-15   23  74      114
##   16-20    0  40       69
##   21-25   32  94      139
##   26-30   12  10       20
##   6-10     4  13        5
df_tt_7 <- data.frame(expand.grid(row.names(tt_7),colnames(tt_7)),c(tt_7))
colnames(df_tt_7) <- c("Age","Adaptability","Count.t")
  ggplot(df_tt_7, aes(fill = Adaptability, y = Count.t,x = Age))+
    geom_bar(position="dodge", stat="identity")+
    geom_text(aes(x = Age, y = as.numeric(Count.t), label = format(Count.t, digits = 1)),
            size = 2.5,
            position = position_dodge(.9),
            inherit.aes = TRUE,
            na.rm = TRUE, vjust = -1)+
    ggtitle("Age Distribution by Adaptivity Level for Boys")


From this chart is possible observe that, for the Male gender respondents for this survey:
a) Between the ages 26-30 majority of the respondents answered as having Moderate adaptivity
b) Bigger number of responses related to the Moderate adaptativity were observed at the ages range between 11-15, 16-20,21-25 and 26-30
c) For the age range of 1-5 all respondents answered Low
d) And for 6-10 majority of responses were low.

5. Preparing the data to ACM

To input the data at MCA formula is needed to convert the data for factor format when applying dud.acm().

# Data convert to factor
original_data_factor <- as.data.frame(unclass(original_data), stringsAsFactors = TRUE)

The next step here is to extract the observed frequencies, we do it by summary().

summary(original_data_factor)
##   Gender       Age        Education.Level       Institution.Type IT.Student
##  Boy :663   1-5  : 81   College   :219    Government    :382     No :901   
##  Girl:542   11-15:353   School    :530    Non Government:823     Yes:304   
##             16-20:278   University:456                                     
##             21-25:374                                                      
##             26-30: 68                                                      
##             6-10 : 51                                                      
##  Location  Load.shedding Financial.Condition     Internet.Type Network.Type
##  No :270   High: 201     Mid :878            Mobile Data:695   2G: 19      
##  Yes:935   Low :1004     Poor:242            Wifi       :510   3G:411      
##                          Rich: 85                              4G:775      
##                                                                            
##                                                                            
##                                                                            
##  Class.Duration Self.Lms       Device     Adaptivity.Level
##  0  :154        No :995   Computer: 162   High    :100    
##  1-3:840        Yes:210   Mobile  :1013   Low     :480    
##  3-6:211                  Tab     :  30   Moderate:625    
##                                                           
##                                                           
## 

6. Contingency tables

We want to test if exist correlation between Adaptivity.Level (referencial attribute) and the other attributes. So Adaptivity.Level will be our row, and we need compare with each one of the others attributes. As output we will have 4 tables where:
1.Numbers displayed in blue are row percentage for observed freq.
2.Numbers displayed in green are col. percentage for observed freq.
3.Numbers displayed in black are observed freq.
4.Numbers displayed in dark green are explained freq.


The hypothesis to be tested are:


H0: The association between the two categorical variables is random.


H1: The association between the two categorical variables is not random.



The Variables to have its association tested are: Gender; Age; Education.Level; Institution.Type; IT.Student; Location; Load.shedding; Financial.Condition; Internet.Type; Network.Type; Class.Duration; Self.Lms; Device;

sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
         var.col = original_data_factor$Device,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE, 
         encoding = "UTF-8")
Adaptivity.Level Device Total
Computer Mobile Tab
High 30
13
30 %
18.5 %
68
84
68 %
6.7 %
2
2
2 %
6.7 %
100
100
100 %
8.3 %
Low 40
65
8.3 %
24.7 %
438
404
91.2 %
43.2 %
2
12
0.4 %
6.7 %
480
480
100 %
39.8 %
Moderate 92
84
14.7 %
56.8 %
507
525
81.1 %
50 %
26
16
4.2 %
86.7 %
625
625
100 %
51.9 %
Total 162
162
13.4 %
100 %
1013
1013
84.1 %
100 %
30
30
2.5 %
100 %
1205
1205
100 %
100 %
χ2=52.519 · df=4 · Cramer’s V=0.148 · Fisher’s p=0.000
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
         var.col = original_data_factor$Self.Lms,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE, 
         encoding = "UTF-8")
Adaptivity.Level Self.Lms Total
No Yes
High 70
83
70 %
7 %
30
17
30 %
14.3 %
100
100
100 %
8.3 %
Low 428
396
89.2 %
43 %
52
84
10.8 %
24.8 %
480
480
100 %
39.8 %
Moderate 497
516
79.5 %
49.9 %
128
109
20.5 %
61 %
625
625
100 %
51.9 %
Total 995
995
82.6 %
100 %
210
210
17.4 %
100 %
1205
1205
100 %
100 %
χ2=29.535 · df=2 · Cramer’s V=0.157 · p=0.000
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
         var.col = original_data_factor$Class.Duration,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE, 
         encoding = "UTF-8")
Adaptivity.Level Class.Duration Total
0 1-3 3-6
High 0
13
0 %
0 %
82
70
82 %
9.8 %
18
18
18 %
8.5 %
100
100
100 %
8.3 %
Low 144
61
30 %
93.5 %
290
335
60.4 %
34.5 %
46
84
9.6 %
21.8 %
480
480
100 %
39.8 %
Moderate 10
80
1.6 %
6.5 %
468
436
74.9 %
55.7 %
147
109
23.5 %
69.7 %
625
625
100 %
51.9 %
Total 154
154
12.8 %
100 %
840
840
69.7 %
100 %
211
211
17.5 %
100 %
1205
1205
100 %
100 %
χ2=225.918 · df=4 · Cramer’s V=0.306 · p=0.000
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
         var.col = original_data_factor$Network.Type,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE, 
         encoding = "UTF-8")
Adaptivity.Level Network.Type Total
2G 3G 4G
High 0
2
0 %
0 %
22
34
22 %
5.4 %
78
64
78 %
10.1 %
100
100
100 %
8.3 %
Low 16
8
3.3 %
84.2 %
186
164
38.8 %
45.3 %
278
309
57.9 %
35.9 %
480
480
100 %
39.8 %
Moderate 3
10
0.5 %
15.8 %
203
213
32.5 %
49.4 %
419
402
67 %
54.1 %
625
625
100 %
51.9 %
Total 19
19
1.6 %
100 %
411
411
34.1 %
100 %
775
775
64.3 %
100 %
1205
1205
100 %
100 %
χ2=30.243 · df=4 · Cramer’s V=0.112 · Fisher’s p=0.000
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
         var.col = original_data_factor$Load.shedding,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE, 
         encoding = "UTF-8")
Adaptivity.Level Load.shedding Total
High Low
High 13
17
13 %
6.5 %
87
83
87 %
8.7 %
100
100
100 %
8.3 %
Low 100
80
20.8 %
49.8 %
380
400
79.2 %
37.8 %
480
480
100 %
39.8 %
Moderate 88
104
14.1 %
43.8 %
537
521
85.9 %
53.5 %
625
625
100 %
51.9 %
Total 201
201
16.7 %
100 %
1004
1004
83.3 %
100 %
1205
1205
100 %
100 %
χ2=9.972 · df=2 · Cramer’s V=0.091 · p=0.007
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
         var.col = original_data_factor$Location,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE, 
         encoding = "UTF-8")
Adaptivity.Level Location Total
No Yes
High 8
22
8 %
3 %
92
78
92 %
9.8 %
100
100
100 %
8.3 %
Low 171
108
35.6 %
63.3 %
309
372
64.4 %
33 %
480
480
100 %
39.8 %
Moderate 91
140
14.6 %
33.7 %
534
485
85.4 %
57.1 %
625
625
100 %
51.9 %
Total 270
270
22.4 %
100 %
935
935
77.6 %
100 %
1205
1205
100 %
100 %
χ2=82.310 · df=2 · Cramer’s V=0.261 · p=0.000
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
         var.col = original_data_factor$IT.Student,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE, 
         encoding = "UTF-8")
Adaptivity.Level IT.Student Total
No Yes
High 67
75
67 %
7.4 %
33
25
33 %
10.9 %
100
100
100 %
8.3 %
Low 391
359
81.5 %
43.4 %
89
121
18.5 %
29.3 %
480
480
100 %
39.8 %
Moderate 443
467
70.9 %
49.2 %
182
158
29.1 %
59.9 %
625
625
100 %
51.9 %
Total 901
901
74.8 %
100 %
304
304
25.2 %
100 %
1205
1205
100 %
100 %
χ2=19.597 · df=2 · Cramer’s V=0.128 · p=0.000
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
         var.col = original_data_factor$Age,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE, 
         encoding = "UTF-8")
Adaptivity.Level Age Total
1-5 11-15 16-20 21-25 26-30 6-10
High 0
7
0 %
0 %
28
29
28 %
7.9 %
5
23
5 %
1.8 %
38
31
38 %
10.2 %
12
6
12 %
17.6 %
17
4
17 %
33.3 %
100
100
100 %
8.3 %
Low 17
32
3.5 %
21 %
120
141
25 %
34 %
144
111
30 %
51.8 %
139
149
29 %
37.2 %
36
27
7.5 %
52.9 %
24
20
5 %
47.1 %
480
480
100 %
39.8 %
Moderate 64
42
10.2 %
79 %
205
183
32.8 %
58.1 %
129
144
20.6 %
46.4 %
197
194
31.5 %
52.7 %
20
35
3.2 %
29.4 %
10
26
1.6 %
19.6 %
625
625
100 %
51.9 %
Total 81
81
6.7 %
100 %
353
353
29.3 %
100 %
278
278
23.1 %
100 %
374
374
31 %
100 %
68
68
5.6 %
100 %
51
51
4.2 %
100 %
1205
1205
100 %
100 %
χ2=125.296 · df=10 · Cramer’s V=0.228 · Fisher’s p=0.000
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
         var.col = original_data_factor$Gender,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE, 
         encoding = "UTF-8")
Adaptivity.Level Gender Total
Boy Girl
High 71
55
71 %
10.7 %
29
45
29 %
5.4 %
100
100
100 %
8.3 %
Low 245
264
51 %
37 %
235
216
49 %
43.4 %
480
480
100 %
39.8 %
Moderate 347
344
55.5 %
52.3 %
278
281
44.5 %
51.3 %
625
625
100 %
51.9 %
Total 663
663
55 %
100 %
542
542
45 %
100 %
1205
1205
100 %
100 %
χ2=13.451 · df=2 · Cramer’s V=0.106 · p=0.001
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
         var.col = original_data_factor$Education.Level,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE, 
         encoding = "UTF-8")
Adaptivity.Level Education.Level Total
College School University
High 3
18
3 %
1.4 %
47
44
47 %
8.9 %
50
38
50 %
11 %
100
100
100 %
8.3 %
Low 120
87
25 %
54.8 %
182
211
37.9 %
34.3 %
178
182
37.1 %
39 %
480
480
100 %
39.8 %
Moderate 96
114
15.4 %
43.8 %
301
275
48.2 %
56.8 %
228
237
36.5 %
50 %
625
625
100 %
51.9 %
Total 219
219
18.2 %
100 %
530
530
44 %
100 %
456
456
37.8 %
100 %
1205
1205
100 %
100 %
χ2=38.686 · df=4 · Cramer’s V=0.127 · p=0.000
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
         var.col = original_data_factor$Institution.Type,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE, 
         encoding = "UTF-8")
Adaptivity.Level Institution.Type Total
Government Non Government
High 20
32
20 %
5.2 %
80
68
80 %
9.7 %
100
100
100 %
8.3 %
Low 234
152
48.8 %
61.3 %
246
328
51.2 %
29.9 %
480
480
100 %
39.8 %
Moderate 128
198
20.5 %
33.5 %
497
427
79.5 %
60.4 %
625
625
100 %
51.9 %
Total 382
382
31.7 %
100 %
823
823
68.3 %
100 %
1205
1205
100 %
100 %
χ2=107.108 · df=2 · Cramer’s V=0.298 · p=0.000
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
         var.col = original_data_factor$Internet.Type,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE, 
         encoding = "UTF-8")
Adaptivity.Level Internet.Type Total
Mobile Data Wifi
High 36
58
36 %
5.2 %
64
42
64 %
12.5 %
100
100
100 %
8.3 %
Low 288
277
60 %
41.4 %
192
203
40 %
37.6 %
480
480
100 %
39.8 %
Moderate 371
360
59.4 %
53.4 %
254
265
40.6 %
49.8 %
625
625
100 %
51.9 %
Total 695
695
57.7 %
100 %
510
510
42.3 %
100 %
1205
1205
100 %
100 %
χ2=21.036 · df=2 · Cramer’s V=0.132 · p=0.000
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
         var.col = original_data_factor$Financial.Condition,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE, 
         encoding = "UTF-8")
Adaptivity.Level Financial.Condition Total
Mid Poor Rich
High 36
73
36 %
4.1 %
22
20
22 %
9.1 %
42
7
42 %
49.4 %
100
100
100 %
8.3 %
Low 341
350
71 %
38.8 %
129
96
26.9 %
53.3 %
10
34
2.1 %
11.8 %
480
480
100 %
39.8 %
Moderate 501
455
80.2 %
57.1 %
91
126
14.6 %
37.6 %
33
44
5.3 %
38.8 %
625
625
100 %
51.9 %
Total 878
878
72.9 %
100 %
242
242
20.1 %
100 %
85
85
7.1 %
100 %
1205
1205
100 %
100 %
χ2=236.865 · df=4 · Cramer’s V=0.314 · p=0.000

Looking at all p-values, is possible to reject H0, so is possible to say that there is a not random association between all the categorical variables and the Adaptivity.Level.

Thus for the MCA we will keep all the attributes

7. Creating MCA

MCA <- dudi.acm(original_data_factor, scannf = FALSE, nf = 3)
##nf = 3 means that the coordinates extracted are referent to 3 dimensions (the 3 that have bigger part. on the variance), I choose it because I want to plot a 3D graph.
Percentual Variances for Each Dimension(Eigenvalues)
var_perc <- MCA$eig/sum(MCA$eig)*100
paste0(round(var_perc,2),"%")
##  [1] "14.6%"  "10.91%" "8.48%"  "7.45%"  "5.78%"  "5.59%"  "5.11%"  "4.24%" 
##  [9] "4.2%"   "3.88%"  "3.69%"  "3.26%"  "3.15%"  "2.77%"  "2.48%"  "2.34%" 
## [17] "2.27%"  "2.05%"  "1.97%"  "1.81%"  "1.45%"  "1.36%"  "0.96%"  "0.19%"

Were generated 24 dimensions because the maximal number of dimensions is given by number of attributes(J = 38) subtracted by the number of variables(Q = 14).
##### Number of attributes per variable

numb_attr <- apply(original_data_factor,  
                   MARGIN = 2,
                   FUN = function(x) nlevels(as.factor(x)))
#R uses the number of attributes to generate an data frame with the coordinates of the Binary or Burt Matrix 
Consolidating the coordinates

Binary Matriz Method

df_MCA <- data.frame(MCA$c1, Variable = rep(names(numb_attr), numb_attr))

Ploting the Perceptual Map

df_MCA %>%
  rownames_to_column() %>%
  rename(Attribute = 1)%>%
  ggplot(aes(x = CS1, y = CS2, label = Attribute, color = Variable))+
  geom_point()+
  geom_label_repel()+
  geom_vline(aes(xintercept = 0), linetype = "longdash", color = "grey48") +
  geom_hline(aes(yintercept = 0), linetype = "longdash", color = "grey48") +
  labs(x = paste("Dimension 1:", paste0(round(var_perc[1], 2), "%")),
       y = paste("Dimension 2:", paste0(round(var_perc[2], 2), "%"))) +
  theme_bw()
## Warning: ggrepel: 12 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

MCA_3D <- plot_ly()
MCA_3D <- add_trace(p = MCA_3D,
                    x = df_MCA$CS1,
                    y = df_MCA$CS2,
                    z = df_MCA$CS3,
                    mode = "text",
                    text = rownames(df_MCA),
                    textfont = list(color = "blue"),
                    marker = list(color = "red"),
                    showlegend = FALSE)
MCA_3D
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
## A marker object has been specified, but markers is not in the mode
## Adding markers to the mode...

Results


Exists an association between:
a) Adaptivity.Level Low and Institution.Type.Government, Network.Type 2G, Age 26-30, Load.shedding High, Device Mobile, Location.No (out of Town), Gender.Girl.
b)Adaptivity.Level Low and Institution.Type.Non Government, Network.Type 4G, Class.Duration 1.3h, Load.shedding Low, Location.Yes(Town), Gender.Boy, Financial.Condition Mid.
c)Adaptivity.Level High, Financial.Condition Rich.